Efforts by national libraries, institutions, and (inter-) national projects have led to an increased effort in preserving textual contents - including nondigitally born data - for future generations. These activities have resulted in novel initiatives in preserving the cultural heritage by digitization. However, a systematic approach toward Textual Data Denoising (TD2) is still in its infancy and commonly limited to a primarily dominant language (mostly English). However, digital preservation requires a universal approach. To this end, we introduce a “Framework for Enabling Data Denoising via robust contextual embeddings” (FETD2). FETD2 improves data quality by training language-specific data denoising modelsbased on a small number of language-specific training data. Our approach employs a bi-directional language modeling in order to produce noise-resilient deep contextualized embeddings. In experiments we show the superiority compared with the state-of-the-art.
Govind, C. Alec, J.-L. Manguin and M. Spaniol
FETD2: A Framework for Enabling Textual Data Denoising via Robust Contextual Embeddings
Proceedings of the 25th International Conference on Theory and Practice of Digital Libraries (TPDL 2021), virtual conference, September 13-17, 2021, 12 pages (to appear).
BibTeX